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Abstract 

This paper considers the noisy sparse phase retrieval problem: recovering a sparse signal 
X G from noisy quadratic measurements yj = (a'a;)^ + Cj, j = 1,... ,m, with indepen¬ 
dent sub-exponential noise Cj. The goals are to understand the effect of the sparsity of x 
on the estimation precision and to construct a computationally feasible estimator to achieve 
the optimal rates. Inspired by the Wirtinger Flow [12] proposed for noiseless and non-sparse 
phase retrieval, a novel thresholded gradient descent algorithm is proposed and it is shown 
to adaptively achieve the minimax optimal rates of convergence over a wide range of sparsity 
levels when the a^ ’s are independent standard Gaussian random vectors, provided that the 
sample size is sufficiently large compared to the sparsity of x. 

Keywords: High-dimensional M-estimation; Iterative thresholding; Minimax rate; Non- 
convex empirical risk; Phase retrieval; Sparse recovery; Thresholded gradient method. 


1 Introduction 

In a range of fields in science and engineering, researchers face the problem of recovering a p- 
dimensional signal of interest x by probing the signal via a set of p-dimensional sensing vectors 
ttj, j = 1,..., m, and hence the observations are the {a'-xYs contaminated with noise. This gives 
rise to the linear regression model in statistical terminology where x is the regression coefficient 
vector and A = [ai,..., is the design matrix. There is an extensive literature on the theory 
and methods for the estimation/recovery of x under such a linear model. However, in many 
important applications, including X-ray crystallography, microscopy, astronomy, diffraction and 
array imaging, interferometry, and quantum information, it is sometimes impossible to observe 
a'jX directly and the measurements that one is able to obtain are the magnitude/energy of a'jX 
contaminated with noise. In other words, the observations are generated by the following phase 
retrieval model: 

yj = \a'jx\^ + ej, j = (1.1) 
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where e = (ei,... ,6^)^ is a vector of stochastic noise with Ee = 0. Note that E(yj) = \a'-x\^, 
so in the real case, (1.1) can be treated as a generalized linear model with the multi-value link 
function g{z) := We refer interested readers to [41] and the reference therein for more 

detailed discussions on scientific and engineering background for this model. 

In many applications, especially those related to imaging, the signal a; G admits a sparse 
representation under some known and deterministic linear transformation. Without loss of gener¬ 
ality, we assume in the rest of the paper that such a linear transform has already taken place and 
hence the signal x is sparse itself. In this case, model (1.1) is referred to as the sparse phase re¬ 
trieval model. In addition, we consider the case where e are independent centered sub-exponential 
random errors. This is motivated by the observation that in the application settings where model 
(1.1) is appropriate, especially in optics, heavy-tailed noise may arise due to photon counting. 

Efficient computational methods for phase retrieval have been proposed in the community of 
optics, and they are mostly based on the seminal work by Gerchberg, Saxton, and Fienup [21, 19]. 
The effectiveness of these methods relies on careful exploration of prior information of the signal 
in the spatial domain. Moreover, these methods were revealed later as non-convex successive 
projection algorithms [30, 4]. This provides insight for occasional observation of stagnation of 
iterates and failure of convergence. 

Recently, inspired by multiple illumination, novel computational methods were proposed for 
phase retrieval without exploring and employing a priori information of the signal. These methods 
include semidefinite programming [14, 10, 11, 44, 13], polarization [2], alternating minimization 
[37], gradient methods [12], alternating projection [35], etc. More importantly, profound and 
remarkable theoretical guarantees for these methods have also been established. As for noiseless 
sparse phase retrieval, semidefinite programming has been proven to be effective with theoret¬ 
ical guarantees [31, 38, 22]. Other empirical methods for sparse phase retrieval include belief 
propagation [39] and greedy methods [40]. 

Regarding noisy phase retrieval, some stability results have been established in the literature; 
See [9, 42, 15]. In particular, stability results have been established in [16] for noisy sparse phase 
retrieval by semidefinite programming, though the authors did not study the optimal dependence 
of the convergence rates on the sparsity of the signal and the sample size. Nearly minimax 
convergence rates for sparse phase retrieval with Gaussian noise have been established in [28] 
under sub-gaussian design matrices. However, the optimal rates are achieved by empirical risk 
minimization under sparsity constraints, in which both the objective function and the constraint 
are non-convex, implying that the procedure is not computationally feasible. 

In the present paper, we establish the minimax optimal rates of convergence for noisy sparse 
phase retrieval under sub-exponential noise, and propose a novel thresholded gradient descent 
method in order to estimate the signal x under the model (1.1). For conciseness, we focus on 
the case where the signal and the sensing vectors are all real-valued, and the key ideas extend 
naturally to the complex case. The theoretical analysis sheds light on the effects of the sparsity 
of the signal x and the presence of sub-exponential noise on the minimax rates for the estimation 
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of X under the (.2 loss, as long as the sensing vectors a^-’s are independent standard Gaussian 
vectors. Combining the minimax upper and lower bounds given in Section 3, the optimal rate of 
convergence for estimating the signal x under the £2 loss is k is the sparsity 

of X, II • II 2 is the usual Euclidean norm, and a characterizes the noise level. Moreover, it is 
shown that the thresholded gradient descent procedure is both rate-optimal and computationally 
efficient, and the sample size requirement matches the state-of-the-art result in computational 
sparse phase retrieval under structureless Gaussian design matrices. 

We explain some notation used throughout the paper. For any n-dimensional vector v = 
(ui,..., Vn)' and a subset S' C {1,..., n}, we denote by vs the n-dimensional vector by keeping 
the coordinates of v with indices in S unchanged, while changing all other components to zero. 
We also denote ||i>||g := {vf for q > 1, and ||i>||oo = maxi<fc<n \vk\- Also denote 

||t)||o as the number of nonzero components of v. For any matrix M G and any subsets 

Si G {1,..., ni} and S '2 G {1,..., n 2 }, MsiS 2 ^ is defined by keeping the submatrix of M 

with row index set S'! and column index set S' 2 , while changing all other entries to zero. For any 
> 1 and (72 > 1, we denote ||iW||q 2 _>gj the induced norm from the Banach space || . 

II • llqj). For simplicity, denote ||iW|| := ||M'|| 2 ->. 2 - We also denote by the n x n identity 
matrix. 

The rest of the paper is organized as follows: In Section 2, we introduce in detail the thresh¬ 
olded gradient descent procedure, which consists of two steps. The first is an initialization step by 
applying a diagonal thresholding method to a matrix constructed with available data. The second 
step applies iterative thresholding procedure for the recovery of the sparse vector x. Section 3 
establishes the minimax optimal rates of convergence for noisy sparse phase retrieval under the 
£2 loss. The results show that the proposed thresholded gradient descent method is rate-optimal. 
In Section 4, numerical simulations illustrate the effectiveness of thresholding in denoising, and 
demonstrate how the relative estimation error depends on the thresholding parameter /3, sample 
size m, sparsity k, and the noise-to-signal ratio iT/||a;|||. In Section 5, we discuss the connections 
between our thresholded gradient method for noisy sparse phase retrieval and related methods 
proposed in the literature for high-dimensional regression. The proofs are given in Section 6 with 
some technical details deferred to the appendix. 

2 Methodology 

The major component of the our method is a thresholded gradient descent algorithm to obtain a 
sparse solution to a given non-convex empirical risk minimization problem. Due to the non-convex 
nature of the problem, in order to avoid any local optimum that is far away from the truth, the 
initialization step is crucial. Thus, we also provide a candidate method which can be justified 
theoretically for yielding a good initializer. The methodology is proposed assuming that A has 
standard Gaussian entries, though it could potentially also be used when such an assumption does 
not necessarily hold. 
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2.1 Thresholded Wirtinger flow 

Given the sensing vectors aj and the noisy magnitude measurements yj as in (1.1) for j = 1, ... ,m, 
one can consider estimating x by minimizing the following empirical risk function 

^ m 

(2.1) 

i=i 

Statistically speaking, in the low-dimensional setup with hxed p and m —)■ oo, if the additive 
noises are heavy-tailed, least-absolute-deviations (LAD) methods might be more robust than 
least-squares methods. However, recent progress in modern linear regression analysis shows that 
least-squares could be preferable to LAD when p and m are proportional, even the noises are sub¬ 
exponential [18]. Due to this surprising phenomenon, we simply take the least-squares empirical 
risk in (2.1), although phase retrieval is a nonlinear regression problem, which could be very 
different from linear regression. More importantly, close-form gradient methods can be induced 
from the empirical risk function in (2.1), which is computationally convenient. To be specific, at 
any current value of z, one updates the estimator by taking a step along the gradient direction 

. m 

V/(2) = —Y^ {\ajZ\‘^ - yj) {a'^z)aj (2.2) 

^ j=i 

until a stationary point is reached. Indeed, Candes et al. [12] showed that under appropriate con¬ 
ditions, initialized by an appropriate spectral method, a gradient method, referred to as Wirtinger 
flow, leads to accurate recovery of a; up to a global phase in the complex domain and noiseless 
setting. 

However, the direct application of gradient descent is not ideal for noisy sparse phase retrieval 
since it does not utilize the knowledge that the true signal x is sparse in order to mitigate the 
contamination of the noise. To incorporate this a priori knowledge, it makes sense to seek a 
“sparse minimizer” of (2.1). To this end, suppose we have a sparse initial guess x^^'^ for x. To 
update x^^') to another sparse vector, we may take a step along Vf{x^^'>), and then sparsify the 
result by thresholding. 

Indeed, if we were given the oracle knowledge of the support S of x, then we can reduce the 
problem to recovering xs based on the {yj, ajs}JLi- avoiding estimating any coordinate of x 
in we could greatly reduce variance of the resulting estimator of a;. In reality, we do not have 
such oracle knowledge and the additional thresholding step added on top of gradient descent is 
intended to mimic the oracle behavior by hopefully restricting all the updated coordinates on S. 

Let Tr be any thresholding function satisfying 

Tt{x) = 0, Va:G[—r,r], and \Tt{x) — x\ < t, Vx G M. (2.5) 

For any vector b = (6 i,..., bpY, let T)(h) = (T)(6 i),..., T)(6p))'. With the foregoing definition, 
the proposed thresholded gradient descent method can be summarized as Algorithm 1 . In view of 
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Algorithm 1: Thresholded Wirtinger flow for noisy sparse phase retrieval 
Input: Data {o-j,yj}^i] initial estimator xq] thresholding function 7~; gradient tuning 
parameter //; thresholding tuning parameter /3; number of iterations T. 
Output: Final estimator x. 

1 Initialize n 0 and x^^^ = xq. 
repeat 

Compute threshold level 






(3 log(mp) 




(11 2 

^ |a'.®W|2 ; (2.3) 


t=i 


Update 


:= T^,(SW) , (2.4) 


until n = T; 

where V/ is defined in (2.2); 
4 Return x = . 


the Wirtinger flow method for noiseless phase retrieval [12], we name our approach the “Thresh¬ 
olded Wirtinger Flow” method. The data-driven choice of the threshold level in (2.3) is motivated 
by the following intuition. Recall that we assume the sensing vectors {aj : j = 1,... ,m} are in¬ 
dependent standard Gaussian vectors. For a fixed z, if we act as if each (|a'zp — yj)(a'z) is a 
fixed constant, then the gradient in (2.2) is a linear combination of Gaussian vectors and hence 
has i.i.d. Gaussian entries with mean zero and variance ^ ~ Therefore, 

the threshold t(z) is simply (3 log(mp) times the standard deviation of these Gaussian random 
variables, which is essentially the universal thresholding in the Gaussian sequence model literature 
[24]. Although the above intuition is not exactly true, the resulting thresholds in (2.3) are indeed 
the right choices as justified later in Section 3, and illustrated in Section 4. Notice that there are 
two tuning parameters /i and (3, which should be treated as absolute constants. We will validate 
some theoretical choices and also provide practical choices later. 

2.2 Initialization 

It is worth noting that the success of Algorithm 1 depends crucially on the initial estimator for 
two reasons. First, the empirical risk (2.1) is a non-convex function of z and hence it could 
have multiple local minimizers. Hence the success of a gradient descent based approach depends 
naturally on the starting point. Moreover, an accurate initializer can reduce the required number 
of iterations in the thresholded Wirtinger flow algorithm. In view of its crucial rule, we propose 
in Algorithm 2 an initialization method which can be proven to yield a decent starting point for 
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Algorithm 2: Initialization for Algorithm 1 
Input: Data tuning parameter a. 

Output: Initial estimator Xq. 

1 Compute 



j=i 


and 

^ m 

i=i 

2 Select a set of coordinates 

3 Compute a p X p matrix 

^ m 

^SoSo 

i=l 

4 Return 


*0 = 


where vi as the leading eigenvector of ^ 5 ^ 5 ^- 


( 2 . 6 ) 


(2.7) 


( 2 . 8 ) 


(2.9) 


( 2 . 10 ) 


Algorithm 1 under our modeling assumption. 

The motivation of the algorithm is similar to that of diagonal thresholding [25] for sparse 
PCA: we want to identify a small collection of coordinates with big marginal signals and then 
compute an estimator of x by focusing only on these coordinates. In particular, the quantity 
Ii in (2.7) captures the marginal signal strength of the /-th coordinate and Sq (2.8) selects all 
coordinates with big marginal signals. Last but not least, (2.9) and (2.10) computes the initial 
estimator by focusing only on the coordinates in Sq. There is a tuning parameter a needed as 
input of the algorithm, which can be treated as an absolute constant. We will provide some 
justified theoretical choice later. 

3 Theory 

We first establish the statistical convergence rate for the thresholded Wirtinger flow method un¬ 
der the case of “Gaussian design”, i.e., aj J\f{0,lp) for j = l,...,m in (1.1). Moreover, 
we assume the signal x is A:-sparse, i.e., ||a;||o = k, and the noises ei,...,em are m indepen¬ 
dent centered sub-exponential random variables with maximum sub-exponential norm a, i.e.. 
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a := maxi<j<m Here for any random variable X, its sub-exponential norm is defined 

as ||^||?/)i := supp>;^ p“^(E |X|P)j>. This definition, as well as some fundamental properties of 
sub-exponential variables (such as Bernstein inequality), can be found in Section 5.2.4 of [43]. 


Theorem 3.1 Suppose (3 = 4 in (2.3), and a = K (2-8) for some absolute constant 


K. 


\ II HZ/ 

Suppose pL < pLo in (2.4) and m > C (^1 + log(mp). For all t = 1, 2,3,..., there holds 


sup ^{A,y\x) m 


® 0 


=k 


mm kc 
i=0,l 


W _ V 


(- 1)*®||2 > 


6 V 16/ ll^lh \ m J 


46 10 

< — + 


m e^ mp^ 


where fio, C, and Cq are some absolute constants. 


The proof is given in Section 6. Lemma 6.3 guarantees the efficacy of the initialization step 
Algorithm 2, and Lemmas 6.4 and 6.5 explain why the thresholded Wirtinger flow method leads 
to accurate estimation. Here P = 4 and a = AT ^1 -|- ure chosen for analytical convenience. 

The discussion of empirical choices of /3, a, and fi are deferred to Section 4. 

Let us interpret Theorem 3.1 by considering the following cases. In the noiseless case, with high 
probability, we obtain min — (—l)®a ;||2 < g (l — ||®|| 2 - This implies that thresholded 

gradient descent method leads to linear convergence to the original signal up to a global sign. 

In the noisy case, if /x > 0 is an absolute constant, by letting t x log (1/5) where 6 = 
j|T|p obtain min — (—l)®a ;||2 A with high probability. If the knowl¬ 

edge of 6 is not available, by choosing t = O(logp), we can obtain min — (—l)*a ;||2 A 

i=0,l 


a_ ^ for any predetermined c > 0. The convergence rate tt^a/ is better than 

the upper bound result established in [28], which is achieved by the intractable sparsity con¬ 
strained empirical risk minimization. Our contribution is to show that this rate can be obtained 
tractably by a fast algorithm. 

Ignoring any polylog factor, the above convenient properties of thresholded Wirtinger flow are 
guaranteed by the sample size condition m > When m <C p, this condition is crucial for the 


effectiveness of initialization Algorithm 2. An immediate question is whether such a minimum 
sample size condition is in some sense necessary for any computationally efficient algorithm, if the 
sensing matrix is random and structureless? A similar phenomenon has been previously observed 
in the related but different problem of sparse principal component analysis. Assuming the hardness 
of the planted clique problem [3], a series of papers [6, 45, 20] have shown that a comparable 
minimum sample size condition is necessary for any estimator computable in polynomial time 
complexity to achieve consistency and optimal convergence rates uniformly over a parameter 
space of interest. In particular, it was shown in [20] that this is the case even for the most 
restrictive parameter space in sparse principal component analysis - (discretized) Gaussian single 
spiked model with a sparse leading eigenvector. Establishing comparable computational lower 
bounds for sparse phase retrieval, especially under the Gaussian design, is an interesting project 
for future research. 
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In the case when m > p ignoring any log factor, it is well-known that a consistent initializer 
can be obtained by spectral methods [37, 12], no matter whether x is sparse or not. In other 
words, the diagonal thresholding idea in Algorithm 2 is not as crucial as in the case m <C p. It 
is interesting to investigate whether m> k'^ can be relaxed such that the optimal converge rates 
can still be achieved by thresholded Wirtinger flow. 

The convergence rate n^jj^ essentially optimal. The following lower bound result has 

been essentially proven in [28]: 

Theorem 3.2 ([28]) Let Q{k,p,R) = {a; G : lla;jj 2 = i?, JJtcjjo = k}. Suppose the aj’s are 
i.i.d. the ej’s are i.i.d. AA(0,cj^), and they are mutually independent. There holds under 

model (1.1), 

inf sup ^{A,y\x) min P - (-1)*®||2 > 

® xee{k,p,R) Y*=o,i K 

provided m>C + 1^ k\og{ep/k), where both C and Co are some absolute eonstants. 

Notice that for a standard Gaussian variable with variance its sub-exponential norm is 
a constant multiple of a. For brevity, we do not scale the Gaussian noises such that their sub¬ 
exponential norms are strictly less than or equal to a. 

4 Numerical Simulation 

In this section, we report numerical simulation results to demonstrate how the relative estimation 
error depends on the thresholding parameter /3, the noise-to-signal ratio (NSR) the sample 

size m, and the sparsity k. To guarantee fair comparison, we always hx the length of the signal 
p = 1000 and the initialization parameter a = 0.1 (except for the first case on thresholding effect). 
Moreover, in each numerical experiment, we conservatively choose gradient parameter p = 0.01, 
and the number of iterations T = 1000 for thresholded Wirtinger flow. The resulting estimator is 
denoted as x = With each fixed k, the support of x is uniformly distributed at random. 

The nonzero entries of x are i.i.d. ~ M{0, 1). The noise e ~ AA(0, a'^Im), where a is determined 
by \\x '^2 and the choice of NSR As discussed before, the design matrix A consists of 

independent standard Gaussian random variables. 

1. Thresholding effect: Fix a = 0.1, m = 7000, k = 100, and iT/jja;jj2 = 1. For each /3 = 
0,0.25,0.5,..., 3, we implement the algorithm for 10 times with independently generated 
A, X, and e. and then take the average of the 10 independent relative errors min(jjS — 
ajjb, P + ®||2)/p||2- The relation between the average relative error and the choice of /3 
is plotted as the red curve in Figure 1. The result shows that the average relative error 
essentially decreases from 0.2365 to 0.1151 as the thresholding parameter increases from 0 
to 0.75, and then increases slowly up to 0.1684 as fi continues to increase to 3. 


k log{ep/k) 


m 


1 

^ 5 - 








We implement the above experiments again with the only difference a = 0.5. The relation 
curve between the relative estimation error and /? is plotted as the blue curve in Fig. 1. It 
is clear that the performance of the algorithm is very close to the case a = 0.1. 



Figure 1: The relation between the average relative error and the thresholding parameter jS. Setup 
of parameters: p = 1000, m = 1000, k = 100, (t/||®||| = 1, = 0.01, and T = 1000. Red curve 

with a = 0.1, while blue curve with a = 0.5. 

2. Noise effect: Fix m = 7000, k = 100, and /3 = 1. In each choice of NSR (T/||a ;||2 = 
0,0.1,..., 1, with 5 instances of (A, £c, e) generated independently, we take the average of 
the relative error min(||S — x\\2, \\x + a;||2)/||®||2- In Figure 2, it shows how the average 
relative error depends on NSR. The average relative error strictly increases from 0.0000 to 
0.1219, as the NSR increases from 0 to 1. 

0.15 


0.1 





Figure 2: The relation between the average relative error and the noise-to-signal-ratio c’'/||a ;||2 
Setup of parameters: p = 1000, m = 1000, k = 100, /3 = 1, a = 0.1, p = 0.01, and T = 1000. 
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3. Sample size effect: Fix k = 100, (T/||a;||2 = 1, and /? = 1. In each choice of m = 
2000,3000,11000, with 5 instances of {A,x,e) generated independently, we take the 
average of the relative error min(||S — £c||2, \\x + a;||2 )/||a3 ||2- In Figure 3, it shows how the 
average relative error depends on the sample size. When the sample sizes are 2000 and 
3000, i.e., twice and three times as large as p, the average relative errors are 0.8444 and 
0.3651 respectively. In these cases, the thresholded gradient descent method leads to poor 
recovery of the original signal. When the sample size increases from 4000 to 11000, the 
average relative error decreases steadily from 0.1692 to 0.0956. 



m 


Figure 3: The relation between the average relative error and the sample size m. Setup of 
parameters: p = 1000, cr/||a;||2 = 1, k = 100, (3 = 1, a = 0.1, p = 0.01, and T = 1000. 

4. Sparsity effect: Fix m = 7000, o'/||®||2 = 1) and 13 = 1. In each choice of sparsity k = 
25, 50,..., 200, with 10 instances of (A, x, e) generated independently, we take the average 
of the relative error min(||® — ®||2, \\x + a;||2)/||a3||2- Figure 4 demonstrates the relation 
between the average relative error and the sparsity. The average relative error essentially 
increases from 0.1059 to 0.1666, as the sparsity increases from 25 to 200. 

5 Discussion 

In this paper, we established the optimal rates of convergence for noisy sparse phase retrieval 
under the Gaussian design in the presence of sub-exponential noise, provided that the sample size 
is sufficiently large. Furthermore, a thresholded gradient descent method called “Thresholded 
Wirtinger Flow” was introduced and shown to achieve the optimal rates. 

Iterative thresholding has been employed in a variety of problems in high-dimensional statis¬ 
tics, machine learning, and signal processing, under the assumption that the signal or parame¬ 
ter vector/matrix satisfies a sparse or low-rank constraint. Examples include compressed sens¬ 
ing/sparse approximation [17, 36, 34, 7], sparse principal component analysis [33, 48], high- 
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Figure 4: The relation between the average relative error and the sparsity k. Setup of parameters: 
p = 1000, (T/||a;||2 = 1, m = 7000, /3 = 1, a = 0.1, p = 0.01, and T = 1000. 

dimensional regression [1, 47, 23], and low-rank recovery [8, 26, 29]. 

Regarding the application of iterative thresholding and projected gradient methods in high¬ 
dimensional M-estimation, their statistical optimality has been established when the empirical 
risk function satisfies certain properties, such as restrictive strong convexity and smoothness (RSC 
and RSS) [1, 47, 23]. Although our thresholded gradient method aims to solve (2.1) for a sparse 
solution, the existing analytical framework for high-dimensional M-estimation does not apply to 
the sparse phase retrieval problem, since the empirical risk function in (2.1) does not satisfy RSC 
in general, no matter how large the sample size is. Instead, we have shown that thresholded 
gradient methods can achieve optimal statistical precision for signal recovery, even when the 
empirical risk function does not satisfy the common assumption of RSC. 

Besides thresholded gradient methods, convexly and non-convexly regularized methods are also 
widely-used for high-dimensional M-estimation. In fact, some iterative thresholding methods are 
induced by regularizations; See, e.g., [17]. Therefore, an alternative candidate method for solving 
the noisy sparse phase retrieval problem is to penalize the empirical risk function in (2.1) before 
taking the minimum, in order to promote a sparse solution. The major difficulty is apparently 
the non-convexity of the empirical risk function. An interesting result in [32] guarantees the 
statistical precision of all local optima, as long as the non-convex penalty satisfies certain regularity 
conditions, and the empirical risk function, possibly non-convex, satisfies the restricted strong 
convexity. A similar result appeared in [46], in which the empirical risk function is required to 
satisfy a sparse eigenvalue (SE) condition. However, back to noisy sparse phase retrieval, the 
empirical risk function in (2.1) satisfies neither RSC nor SE in general, so there is no guarantee 
that all local optima are consistent. A natural question is whether some penalized version of 
(2.1) is strongly convex in a sufficiently large neighborhood of its global minimum, such that a 
tractable initializer lies in this neighborhood provided the sample size is sufficiently large. Another 
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interesting question is whether the global minimizer of such penalized version of (2.1) is a rate- 
optimal estimator of the original sparse signal. We leave these questions for future research. 


6 Proof of Theorem 3.1 


In model (1.1), denote S = supp(a;), which implies |5| = k. Without loss of generality, we assume 
5 = {1,..., k}. As to the Gaussian design matrix A G denote 


A5: = 


“is 


1 

P 

n 

_1 


, A^c := 


1- 

p 

1_ 


Clmgc 


( 6 . 1 ) 


both of which are in 

For any two two random variables/vectors/matrices/sets X and Y, we denote by A II y if X 
and y are independent. 


Lemma 6.1 From the model (1.1), we have y II Age. Moreover, we have {/i,.... 1^} II Age and 
(p II Age, where </> and {/i,... ,Ik} are defined in (2.6) and (2.7), respectively. 

Proof The fact y = |Aa;p -|- e = \AsXs\^ + e implies straightforwardly that y II Age. By 
(2.7), we know for all Z = 1,... ,k, are defined by y and As, which implies that //_lLA 5 c for 
all Z = 1,... ,/c. Finally, by (2.6), we know cj) is determined uniquely by y, which implies that 
(pdLAsc. ■ 


Lemma 6.2 On an event Eq with probability at least 1 — 


1 — ( 2 + Cq- 


\x\ 


a \ /logm 


— Ti—TP? ^ 1 T 1 2 -|- Go 


m kc 


\x\ 


a \ /logm 2 logm 


m 


+ 


m 


for some numerical constant Cq >0. As a consequence, as long as > C{5) |^1 -|- there 

holds 

^ <l-d< <1 + 6 
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\x 


10 


Proof By the dehnition of and yj,j = 1,... ,m, we have 


1 


m 


m 


— ^(a'kc)^ + — Cj 


1=1 1=1 
As shown in Lemma A.7, with probability at least 1 


1 

m 


m 



j=l 


< Goer 




logm 

m 
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for some numerical constant Cq > 0. Moreover, since x is fixed, there holds 




\x\ 




By Lemma 4.1 of [27], with probability at least 1 — —, we have 

^ Er=iK*)^ 


1 - 2 


< L ' <1 + 2 


m 


m\\x 


|2 — 


log m ^ 2 log m 


m 


m 


The proof is done. 


Lemma 6.3 Let a = K [1 + 


for some large enough absolute constant K, and x^^'^ be 
defined in Algorithm 2. There exists a random vector satisfying x^^'^ II Age and supp(a;(*^^) C 
S, such that on an event Eqi with probability at least 1 — ^ — 2e“^, we have 


= 2^*^^ and mini II® 


-®||2, 1!®^°^ T^lb) < x||®||2, 
0 


provided m > C ^1 + A:^log(mp). Here C is an absolute constant. 

Proof Recall that S = {1, ... ,k} and Ii = ^ EEi b®?; for / = 1,... ,p. Define 


So = ll e S ■. Ii> 1 + a 


log(mp) 


C 5. 


m 


( 6 . 2 ) 


Since {Ii,... ,lk, <b} II Aye, we have S'odLAsc. Define x^^'> G M.P as the leading eigenvector of 

. m 

with 2 -norm (j). This easily implies supp(®^°^) G So C S. Since {VPgpgg, 6} II Aye, we also have 
®(°)_ 1 LA 5 c. 

To simplify notation, let us write for any j G [m], yj := {a'-xfi = [a'-gxfi, which implies 
Pj = pj + Cj. Notice that 

^ m ^ m 

= yMi -1) + - E - 1)’ (6-3) 

1=1 1=1 

in which we will first control the second term. For a given Z G [p], we know — 1,..., — 1 are 

i.i.d. centered sub-exponential random variables with sub-exponential norms being an absolute 
constant. Then, by Bernstein inequality (see, e.g.. Proposition 16 in [43]), we have with probability 
at least I - 


m 


1=1 


< 


Co (^||e|| 2 A/log(mp) + ||e||oo log(mp)^ 
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for some absolute constant Cq. Then by Lemma A.7, with probability at least 1 — 4/m, we have 


max 

i<i<P 


m "T—' ■' 


i=i 


<Coa{ I (6.4) 


m 


m 


m 


provided m > C'(logp) for some absolute constant C. 

Next, we prove that with high probability It suffices to prove Sq = 5o, i.e., Sq C S. 

For any I E S'^, aji and yj are independent, and so conditional on {yj^j E [m]}, YyjLiVj^ % ^ 

weighted sum of Xi variables. By Lemma 4.1 of [27], 


E- 1) > ( E^ I + ^ ^maxyj ) t 


i=i 


> < exp(—t). 

Moreover, Chebyshev’s inequality, the Gaussian tail bound and the union bound lead to 

IIa;II 2 > 3m + V96mt > < 

maxyj/ll^lli > < 2mexp(—1/2). 

Thus, with probability at least 1 — for alH G 5^, 

If si-la?, - 1) < + (6,5) 

m ’ V m m V m 

i=i 

Here the last inequality holds when m > C for some absolute constant C. 

Since a = K (l + with large enough K, by (6.3), (6.5), (6.4) and Lemma 6.2, we obtain 
that with probability at least 1 — for all I £ 


log(mp) ^ ^^ 2 . /log(mp) 


m 


m 


< ( 8||®||2 + Coer ) 

which implies that Sq C S. 

Next, we prove that — a;|| 2 /||a ;||2 < g with high probability. For any fixed I G S', 

straightforward calculation yields Eyjo|j = ||a;||| + 2xf. On the other hand, 

= 105xf+ 90x/(||a;||i-x/) + 9(||a;|||-xf)2. 

So for Xj = ||a ;||2 + 2xf — yjOji, we have Xj < ||x ||2 + 2x/ < 3 ||a:|| 2 , E Aj = 0 and E A? = 
20x| + 68 ||x|| 2 x/ + 8 ||x ||2 < 96||a:||2. By Lemma A.l, 


m 

E 


: vA - mdl^lli + 2:.?) < -* < exp (- 
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Next, Lemma 4.1 of [27] leads to with probability at least 1 — 


m 


Vj Il®ll2 — ( ^ 


i=i 


logm ^ 21ogm 


m 


m 


1*112 < 2 . 1 ||a ;||2 


log m 


m 


The last two inequalities, together with (6.4) and (6.3), imply that with probability at least 1 — 
for all I G 5, 

/log(mp) 


h-(l> > 2*r - (16||*||2 + Coa) 


m 


Define S- = ^ S : > (ll + |a) |. Then, for all I ^ S- we have 


2^/6 2 , ^11 ii2 ^ N /log(mp) 


> (^a||®||2 + 6||a;||i - Coa) 
5 


m 


Since a = K + [ji^) sufficiently large absolute constant K, by lemma 6.2, we have or all 
/ G S_, 


m 


with probability at least 1 — 9/m. This implies S- C Sq- 


Therefore, we have ||* —* 5 o||| < \\x — xs_ ||| < (11 + 0.6 q;)||*||2 Y^ ^ < (5^||*|||, provided 

that m > C{6) ^1 + /c^log(mp). Notice that 'KW = ||£c|||ip + 2xx', which implies that 

(E VF )55 = ||a;|| 2 (ip )55 + 2xx'. Furthermore, by the definition of VF, we have 

^ m 1 ^ 


m 


i=i i=i 

By Lemma A.6, with probability at least 1 — 1/m, we have 


1 


m 




'J5®l 


's - (ll®ll2(-^p)5S + 2**') 


i=i 


< - * 


2 ) 


provided m > C{5)k\ogp. Moreover, by Lemma A.7 and Lemma A.8, with probability at 
least 1 — 2/m — 2e“^, we have — Coa^^m{k + logm). By assuming m > 

C'(5)||^lpA;log(mp), we have ^ Yl^=i^j^jSO-'js — lll^lli- This implies that 
||FF5o5o-(E^)So5oII < I|Ws5-(E1^)5s|| <<5||*||i. 

It is noteworthy that the leading eigenvector of (EFF )55 with unit norm is *So/ll®S'oIbi the 
eigengap between the leading two eigenvalues of (EVF) 5 q 5 q is 2||*5(j||2. Recall that x^^'> is the 
leading eigenvector with norm cj). Then by the Sin-Theta theorem, 


3,(0) (^( 0 ))T 


(/2 

ll®So II 2 


< 


5\\x 


2||®Soll2 - <5||®|li 2-5(5 
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By Lemma 6.2, we have 1 + (^ > (/>/||a;||2 > 1 — <5. Together with 1 > ||a;5QIb/ll^lb > 1 — 5, we can 
easily obtain that min(||a;(^^ — x\\ 2 , + x\\ 2 ) < C'o<5||ic||2 for some absolute constant Cq. By 

letting 6 be small enough, we have min(||a;(^^ — x\\ 2 i + x\\ 2 ) < l/6||a;||2. 

In conclusion, we have 

P and min(||a;® — a;||2, ||a;® + x\\ 2 ) < l/6||a;||2') > 1 — — — 2e“^. 

V J m 


Lemma 6.4 Define ri{z) = ■ With probability at least 1 — — 4e ^, for 

all z satisfying \\z — x \\2 < g||®||2 and supp(2:) C S, we have 


z) - x \\2 ^ \\z-x \\2 ^ na /fclogp 


® 2 


® 2 


\\X\ 


m 


provided /U < /tq and m > Ck'^logp. Here Cq, C, and are numerieal constants. This implies 
that, on an event Eq 2 with probability at least 1 — ^ — 8e~^, for all jz G satisfying min(||2: — 
2, \\z + £c|| 2) < g||a;||2 and supp(z) C S, we have 


X 


min(||r/(z) - x\\ 2 , ||r?(z) + ic||2) < min(||z - £c||2, ||z + £c||2) + C'o tt^A• 

V 8/ ll^lb V m 

Proof For z supported on S, define 

u = r?(z) = (^z - ^Vf{z)s^ =z- 

where v G supp(u) C S and ||u||oo < 1- 

Since supp(z) C 5 = {1,..., k}, we have 


= -Yl - yj) 


a 


JS- 


i=i 


For convenience, let 


and so 




75’ 


i=i 


V/(2)s - V/(z)5 = --Yl 


( 6 . 6 ) 


(6.7) 


( 6 . 8 ) 


i=i 


Denote h = z — x € M^, which implies supp(h,) C S and ||h||2 < ||a;||2/6. Straightforward 
calculation yields 


\u — x \\2 < 


h-^Vf{z) 


+ 




Vf{z)s-^)s\^ + ^r{z) 


- + T2^2 + -^nz). 


(6.9) 


It suffices to bound Ti, T 2 and t(z). 
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Bound for Ti By simple algebra, we have 


Tl = 


2_^ 

^ (fp'm 


+ ^{aj'gx){a/ghf + (a/ghf) + ^ Vf{z)g 


i=i 


2 rp I M rr-! 

2 ~ 12- 


In what follows, we derive lower bound for T\i and upper bound for T12 separately. 


Notice that 


i=i 

First, by Lemma A.6 with probability at least 1 — 1/m, we have 


1 


m 


y] 2{aj'gx)‘^{aj'gKf > (2 — 25) {2{x'h)‘^ + HajHill^lli) • 


j=i 


By Lemma A.5, with probability at least 1 — 2/m, we have 


m 


i=i Vi=i i=i 

3 ,1 

m 



< —((3m)4 +/c2 + y21ogm)^||®||2||i«.|i2 

< 10||a:||2||/i||i, 

provided m> CkP for some sufficiently large numerical constant C. This implies 
Tn > (2 - 25)\\x\\l\\h\\l - 10||®||2||^||i > (1/3 - 25)\\x\\l\\h\\l 
As to the upper bound for T12, we can find ||m||2 = 1, such that 

2 


ri2 = \\Vf{z)gr2 < -2 




i=i 


By Holder’s inequality and Lemma A.5, we have 


( 6 . 10 ) 


^12 - ^2 IN 1 ( N l“is(^® + I ( N I ( N 

vi=i / Vi=i / Vi=i / Vi=i 




< ^(( 3 m )4 + A:2 + Y/21ogm)®||fi.||2||2a; + fi.||i||a; + fi.||i||m||2 < ^011/111211*112, 


m 


provided m > Ck"^, with sufficiently large constants Cq and C. To summarize, with probability 
at least 1 — "ilm, 


Tl < 


2-^(l/3-2<5)||/r||2||*||2 + Co^||*||^||/i||2. 


( 6 . 11 ) 
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By Lemma 6.2, letting 5 small enough, we have with probability at least 1 — 6/m, 

Ti < {1- /x/8)||h||2, 

provided fJ. < no with sufficiently small absolute constant //q > 0. 


Bound for T 2 Note that 


T2 < —llajlU 

6m ^ 


j=i 


By Lemma A.7 and Lemma A.8, with probability at least 1 — 2/m — 2e we have 

m 

'^ejajsa'js < CoO'Jm{k + logm) 

1=1 

provided mllogm > k. In summary, by Lemma 6.2, we have that with probability at least 
l-5/m-2e"^. 




\x\ 


m 


Bound for t(z) By simple algebra, 
2 /„n /51ogp 


r (z) = 






1=1 


< 


2/3 log p 


m^ 


+ h.)| |aj_g(a; +/i)| + |ajfg,(a; + h)| 


1=1 

^(T.+T.), 


1=1 


By Holder’s inequality and Lemma A.5, with probability at least 1 — 2lm, we have 


71 < \a/shfj \ajs{2x + h)|® j |a/5(® + /i)|^ 

< Co||A5||^^6ll^ll2ll®ll2 < Coim + A:^)||h,|||||a;||2 

for some numerical constant Cq- By Lemma A.7 and Lemma A.8, with probability at least 
1 — 2/m — 2e“^, we have, 


49,, „2 

U<^||x||, 


1 = 1 


< Corner^ 11®II 2, 


for some numerical constant Cq, provided > /c. In summary, 


^Vkr < C„^ ( v/(“* + fc-)logP 

\ m \\x \\2 \ m 


'•Ib + Arl/—+ (6.12) 


16 


® 2 V m 


provided m > C max(A: logp, k‘^^/logp). 
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Summary We can guarantee that, with probability at least 1 — ^ — 4e 
'\u — x\\r, { ^ \\z — x\ 


\X\\2 


< 1 - 


16 


® 2 


+ C'oM 


k log p a 


m 


\x 


12 ’ 


(6.13) 


for some absolute constant Cq > 0, provided m > Ck'^ log{mp) and p < po. ■ 

Suppose Eq is the intersection of the events Eqi and Eq 2 described by Lemmas 6.3 and 6.4, 
respectively. Then we have 

46 i. 

F{Eo) >1-lOe"^. 

m 

The following induction argument guarantees the effectiveness of thresholded Wirtinger flow: 

Lemma 6.5 Let (3 = 4: and x^'^\n = 0,1,2,... are defined iteratively by (2.10) and (2.4). For 
fixed n > 0, assume that there exists a random vector x^'^'i satisfying and supp(a;(”)) C 

S, and that on an event En C Eq we have x^^'i = and min — (—l)*a;||2 < g||a;||2. Then 

there exists a random vector satisfying x^'^~^^'> II Age and supp(a;(”+^)) C S, and on an 

event En+i C En satisfying F{En/En+i) < 1 — we have and 


mm \\x 
i=0,l 


pa 


(-+1) _ (_iya;||2 < niinp(-)-(-l)*®||2 + Co„ „ 

V 16/i=o,i" a; 2 


klogp 1.. .. 

- < « ® 2, 

m D 


provided m > C (1 + ) k"^ log(mp) for sufficiently large C. 


Proof The improved estimation is defined as 

g(n+l) ^ -7-^ 




,, (®'”> - ^V/(®<”l)) . 


where 7^ is the soft-thresholding operator. We now define 

^(n+i) ^ - ^V/(a^W), 

By the definition of V/, r and cf, as well as the assumption that ^("'^dLA^c and supp(a;(”')) C S', 
we can prove supp(a;(”''’'^)) C S as well as aj^’^+^^JhA^c. In fact, by the definition (2.3), we know 
if is supported on S and independent of Ag'c, then t{x^'^'>) is independent of Age. Moreover, 
by the definition of the gradient (2.2), we know (Vf{x^^^)^^ is supported on S and independent 
of A5C. The assertion is established by the obvious fact fi II Age shown in Lemma 6.1. 

In the following, we will construct £"^+1 C En such that = a;(^+i) on En+i- For any 

i = k + l,k + 2,...,p, with probability 1 - 




1 


m 


E 


< 


1=1 
y^4 log(mp) 




m 


\ 


(la/a^Wp 

1=1 


\apx 


(n)|2 


< r(®A)). 
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The first inequality is due to supp(£c("')) C S and and the second inequality is due to 

/3 = 4. Then with probability at least 1 — 


max 

k+l<i<p 






which implies 

(*<"> - |v/(*'"')) = Tj,,,,.., , 

Notice that on the event En, we have and hence 

Then there exists En+i C En, such that F{En/En+i) < and 


5 (n+l) ^ 






By the assumption, we have 


min(||®^"'^ “ ®lb, + ®||2) < “ll^lb onill„ 

6 

Since En C Eq and = r]{x^'^'^), by Lemma 6.4, we have 


mini \ \x 


-a:||2,||a:("+^^ T^lb) 


< (l - mindlai^”') - x\\2, + alb) + < ^||®lb on En, 

V 16/ ||a||2 V m 6 

provided m > (^((T^/llalb)/?logp for a sufficiently large absolute constant C. Since En+i C En, 
and = a^"'"''^) on En+i, we have 


min ||a(^+i) - (-l)*a||2 < fl - —) min Ha^*") - (-l)*a|b + < - 

i=o,i" ^ ^ 11^ - d 16/1=0,1" ^ ^ a: bV m “6 


-||a|b on En+i- 
6 


Theorem 3.1 can be directly implied by Lemma 6.5. In fact, by Lemma 6.3, we know the 
initial condition in 6.5 holds. For all t = 1, 2, 3,..., straight forward calculation yields 


min(||a*^*) — a|b, IIS*-*^ + aj|b) 1 


® b 




<X 1-:^ 


cr k log p 


on Et 


for some universal constant Cq, where P(Flt) >1-^ — lOe ^ — 


^-k t 


mp^ 
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A Preliminaries and supporting lemmas 

Lemma A.l ([5]) Suppose Xi,..., Xm are i.i.d. real-valued random variables obeying Xi <b for 
some absolute constant 6 > 0, E A* = 0 and E A? = v'^. Setting cr^ = m{b‘^ V v"^), 

P{Xi H-hXm > y} < exp co(l - ^{y/cr)) 

where one can take cq = 25. 


Lemma A.2 (Proposition 34 [43]) Suppose that x ~ J\f{0,ln) is a standard normal random 
vector, and f : M" —>■ M is a 1-Lipschitz function. Then 

F{f{x) — E f{x) > t) < e~^. 

Lemma A.3 (Proposition 33 [43]) Consider two centered Gaussian processes {Xt)t£T and {Yt)t£T 
whose increments satisfy the inequality 

W.\Xs-Xt\^ <¥.\Ys-Yt\^ 


for all s,t € T. Then 


E sup Af < E sup Yt. 

t&T t&T 


Lemma A.4 (Proposition 35 [43]) Let As G be defined in (6.1). Then, with probability at 

least 1 — 2exp(—1^/2), we have the following inequality 

\\As\\ < ^/m + Vk + t. (A-1) 


Lemma A.5 Let As G be defined in (6.1). Then, with probability at least 1 — 4exp(—1^/2), 

the following inequalities hold 


< (15m)^/® + \/fc + t, 


(A.2) 


and 


^5||2^-4 < (3m) + Vk + t. 


(A.3) 


Proof The proof follows that of Theorem 32 in [43] step by step. Define Xu,v = {Asu, v) on 
T = {(it, r») : It G M^,supp(t/) C S, ||it||2 = l,v e ||i’||6/5 = !}• 

Then ||A5||2^6 = P^SiX(^u,v)&T ^u,v Define 

y^v = {9s,u) + {h,v) 

where gs G with supp(g'5) = S and h G are independent standard Gaussian random 
vectors. 
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For any {u,v), {u\v') G T, we have 


IE \Xu,v - Xui,v'\ = \\v\\l + \\v'\\l - 2{u,u'){v,v') 

and 

IE \Yu,v - Yu',v'\ = 2 + ||^;||2 + ||^>'||2 - 2{u,u') - {v,v'). 


Therefore, 


¥.\Xu,v - Xu',v'\ - lElTu,^ - Yu>,v'\ = 2(1 - - {v,v')) > 0, 

due to ||w||2 = ||w'||2 = 1, ||t?||2 < ll'*^ll6/5 = ||i?'||2 < lli^^lle/s = 1- Then by Lemma A.3, we 

have 

E II A5||2^6 < IE max Yu,v = E Hgslb + E ||/i||6 < a/e ||sis||| + (E = \/k + 

{u,v)&T ’ 

Since || • ||2->-6 is a 1-Lipschitz function, by Lemma A.2, there holds with probability at least 
1 — 2 exp(—1^/2) 

IIA5||2^6 <'/k + (15m)^/® + t. 

Similarly, with probability at least 1 — 2exp(—1^/2) 

IIA5||2^4 <Vk + + t. 


Lemma A.6 On an event with probability at least 1 — 1/m, we have 


1 

m 


m 


E 


I %'s® I 


^%s%s 


\x 


{Ip)s + 2xx') 


< (5||£c 


provided m > C{5)klogk, where C{5) is constant only depending on 6. Here {Ip)s by defini¬ 
tion is a diagonal matrix with first k diagonal entries equal to 1, whereas other entries being 0. 
Furthermore, it implies that 


^ IIL 

— ^ 2{x'hf + (1 - 5)||a;|||||h||^ 

^ j=i 


for any h gMP that satisfies supp(/i) C S. 


The proof of this lemma is the same as that of Lemma 7.4 in [12] . 


Lemma A.7 Suppose ei,...,em are independent zero-mean sub-exponential random variables 
with 

a := max ||ei|Lj. 

l<i<m 
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Then with probability at least 1 — we have 


- m 

-T.‘i 

V m 

e oo < Codlogm, 

- m 

-E4 

< Coa^ 

and 

- m 

-E4 

i=i 


i=i 



i=i 


<Coa^ 


provided m > niQ for some numerical constants Cq and mQ. 
Proof By Proposition 16 in [43], we have 


i=l 


> t \ < 2 exp 


—cmin 


t 


9 ’ 

ma^ a 


This implies that with probability at least 1 — we have 


E' 

2=1 


< Cocrmax ^log m, log < C^a^Jm logm 


provided m > This implies that 


- m 

m 


i=i 


< CqCF 


logm 


m 


By the basic properties of sub-exponential random variables, for each j = 1,..., m, we have 

T (jejl '>t)< exp — c—^ , 

which implies that \ej\ < Coulogm with probability at least 1 — ejmf^. This implies that 

||e||oo < Cocrlogm 

with probability at least 1 — ejm^^. 


Since 


<7 > Ijejljii.^ = supp (E|ej|^)p, 

p>i 


we have Ee^ < (2 (t)^ and Ee^ < (4 ct)^. Define 


Then we have EX < (2cj)^, and 


By Chebyshev’s inequality, we have 


^ m 

v = -E4- 

j=i 


Var(X) < (4cr)"^/m. 


P(|X -EXj >t)< 


Var(X) 

' 
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By letting t = (4cj)^, we obtain that with probability at least 1 — 1/m, we have |X| < 20(t^. 


Similarly, with probability at least 1 — 1/m, we have ^ ^ 

constant Cg. 


< CocT^ for some absolute 


Lemma A.8 Suppose Zj G j = 1,... ,m are IID standard normal random vectors. For fixed 


a G 


with probability at least 1 — 2e , we have 


E 

i=i 




vi=i 


< Co (^k\\a\\l + k\\a\ 


for some absolute constant Cq. 


Proof Define 


By Lemma 4 in [43], we have 


A ^^ajZjZj 


i=i \i=i 


|A|| < 2 sup \x'Ax\, 

x^Ni 


where Ni is the 1 /4-net of the unit sphere T* 


■k-l 


For fixed x G Mi, let pj = \z'-x\‘^ — 1. Then 


X 


'Ax = '^ajyj. 
i=i 


Notice that yj, j = are IID sub-exponential variables with ||2/j||^i < K where K is 

an absolute constant. By Bernstein inequality (see, e.g., Proposition 16 in [43]), we have with 
probability at least 1 — 2exp(—4/c), 


i=i 


< 


(Co/2) (^k\\a\\l + k\\a\ 


for some absolute constant Cq. 


Since lA/"!! < 9^, we know with probability at least 1 — 2e we have 
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